The fishHook R package enables agile statistical exploration of coding and non-coding mutational recurrence in cancer through generalized linear modeling (GLM) of heterogenous neutral somatic mutation rates along the genome. fishHook can be applied to the analysis of genes, enhancers, promoters, genomic tiles, or any arbitrary “hypothesis set” (defined by the user as a set of genomic intervals or GRanges), as well as complex sets of the above (e.g. genes sets representing pathways, enhancers sets known to interact with a target gene). The core of the fishHook approach is to enable nomination of loci following the correction of known covariates of neutral mutation, including chromatin state, replication timing, and nucleotide context. The goal of fishHook is to identify cancer “drivers”, i.e. loci that are under positive somatic selection through the application of an accurate null / background model whose application yields straight Q-Q plots.
Though we provide many pre-computed covariates, the key power of fishHook is to empower the user to add custom covariates and (more broadly) provide a framework for generation of custom models, e.g. fitting variant- and tumor-type specific linear models and then integrating deviations from these models to nominate loci (or sets of loci) that deviate from the NULL.
Please include this citation if you use fishHook: Imielinski, Guo, Meyerson Cell. 2017 Jan 26;168(3):460-472
We will demonstrate a quick whole exome analysis using public TCGA lung adenocarcinoma mutation data. Additional packages like gTrack and rtracklayer will help with data import and visualization, but are not necessary to run fishHook.
library(fishHook)
library(gTrack)
library(plotly)
library(rtracklayer)
Read in the mutation data and additional tracks that we will use in this analysis.
## mutation calls cached from public GDAC Broad firehose https://bit.ly/2sFxWY6
mutations = dt2gr(fread('http://mskilab.com/fishHook/hg19/luad.maf'))
## GENCODE v19 genes these are our "hypotheses"
genes = gr.sub(import('http://mskilab.com/fishHook/hg19/gencode.v19.genes.gtf')) %Q% (gene_type == 'protein_coding') %Q% (level<3)
## protein coding cds definitions
cds = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/gencode.v19.cds.rds')))
## bigWig file of fractional coverage of hg19 positions by Agilent exome
## we will use this in combination with cds to define eligible positions
exomecov = import('http://mskilab.com/fishHook/hg19/exome_coverage.bw')
Take a peek at our mutations
head(mutations[, c('Tumor_Sample_Barcode', 'Variant_Type', 'Variant_Classification', 'Reference_Allele', 'Tumor_Seq_Allele2')])
## GRanges object with 6 ranges and 5 metadata columns:
## seqnames ranges strand | Tumor_Sample_Barcode
## <Rle> <IRanges> <Rle> | <character>
## [1] 1 905907 * | TCGA-05-4249-01A-01D-1105-08
## [2] 1 1192480 * | TCGA-05-4249-01A-01D-1105-08
## [3] 1 1854885 * | TCGA-05-4249-01A-01D-1105-08
## [4] 1 9713992 * | TCGA-05-4249-01A-01D-1105-08
## [5] 1 12908093 * | TCGA-05-4249-01A-01D-1105-08
## [6] 1 17257855 * | TCGA-05-4249-01A-01D-1105-08
## Variant_Type Variant_Classification Reference_Allele
## <character> <character> <character>
## [1] SNP Missense_Mutation A
## [2] SNP Silent C
## [3] SNP IGR G
## [4] SNP Intron G
## [5] SNP Missense_Mutation C
## [6] SNP Missense_Mutation C
## Tumor_Seq_Allele2
## <character>
## [1] T
## [2] A
## [3] C
## [4] A
## [5] A
## [6] T
## -------
## seqinfo: 25 sequences from an unspecified genome
First we define the “eligible territory”. This is a key component of all somatic mutational recurrence analyses, since not all of the genome is covered by most sequencing assays. For example in a whole exome sequencing dataset, less than 2% of the genome is reliably captured. In a targeted sequencing panel, this fraction will be even smaller. Even in whole genome sequencing using Illumina short reads, only a subset (70%) of the genome is reliably callable (Heng Li Bioinformatics 2014 Oct 15;30(20):2843–51)
Eligible territory coverage will influence the “denominator” of our recurrence analysis, i.e. the number of positions in each hypothesis interval where a mutation could have possibly been detected. If we do not take eligible territory into account we will mis-estimate the background mutation rate in a given region.
To define eligible territory for this whole exome analysis, we will choose the portion of cds (protein coding) bases that are captured in at least 95% of whole exomes, which represents about 26MB of genome.
eligible = reduce(intersect(exomecov %Q% which(score>0.95), cds, ignore.strand = TRUE))
We define “events” as nonsynonymous mutations. In this simple model, we will lump together SNVs (of different flavors), and indels (of different flavors). (We discuss more complex models that subdivide mutation types later in the tutorial).
events = mutations %Q% (Variant_Classification != 'Silent')
Now that we have loaded our hypotheses (genes), events, and eligible, we are ready to create and analyze a basic FishHook object. Instantiation of the object will involve counting how many events are in the eligible portion of each hypothesis interval. We provide the “idcol” parameter so that each tumor sample (as defined by the “Tumor_Sample_Barcode” column in this events GRanges) can only provide at most one event to the counts in each interval.
fish = Fish(hypotheses = genes[, 'gene_name'], events = events, eligible = eligible, idcol = 'Tumor_Sample_Barcode')
## Contains 19796 hypotheses.
## Contains 56809 events to map to hypotheses.
## Spanning 24.58 MB of eligible territory.
## Covariates:
##
## Hypotheses contains 1 metadata columns.
## Current State: Annotated
We can score this basic FishHook object, i.e. compute p values for every hypothesis, using a simple glm that models a uniform mutation density along the genome, i.e. the glm fits only an intercept and applies no covariates (after correcting for the number of eligible bases in each interval).
The $res field of the FishHook contains a data.table of scoring results. $res has one row per input hypothesis, with p values, FDRs, and additional interval annotations provided with the hypotheses GRanges.
fish$score()
head(fish$res[order(p), ])
fish$qqp()
| seqnames | start | end | width | strand | gene_name | p | fdr | count | effectsize | count.pred | count.density | count.pred.density | p.neg | fdr.neg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 7565097 | 7590856 | 25760 |
|
TP53 | 0.0e+00 | 0.0000 | 106 | 5.62 | 2.15 | 0.0941 | 0.00191 | 1 | 1 |
| 2 | 79384132 | 79386879 | 2748 |
|
REG3A | 1.0e-07 | 0.0009 | 17 | 4.09 | 1.00 | 0.0324 | 0.00191 | 1 | 1 |
| 19 | 10596796 | 10614417 | 17622 |
|
KEAP1 | 5.0e-07 | 0.0031 | 30 | 3.54 | 2.57 | 0.0222 | 0.00191 | 1 | 1 |
| 5 | 24487209 | 24645087 | 157879 |
|
CDH10 | 3.6e-06 | 0.0180 | 41 | 3.19 | 4.50 | 0.0174 | 0.00191 | 1 | 1 |
| 3 | 147103833 | 147124647 | 20815 |
|
ZIC4 | 5.5e-06 | 0.0200 | 18 | 3.41 | 1.69 | 0.0203 | 0.00191 | 1 | 1 |
| 8 | 88882973 | 88886296 | 3324 |
|
DCAF4L2 | 6.4e-06 | 0.0200 | 22 | 3.28 | 2.26 | 0.0186 | 0.00191 | 1 | 1 |
You will notice that this Q-Q plot appears curved and inflated, though its slope (lambda) is reasonably near 1. The low alpha value (MLE of the alpha parameter of the Gamma distribution), suggests that the GLM is detecting additional variance in the data that is unmodeled by a pure Poisson regression. Adding covariates to the model should improve the quality of the fit.
The top hits in the plot (you can hover over them) include TP53 but also many unlikely cancer gene candidates. Among these are olfactory receptors, which are located in late replicating regions of the human genome and thus accumulate neutral mutations more frequently (Lawrence et al 2013 Nature Jul 11;499(7457):214-218).
To address these issues, we will load in data specifying replication timing, chromatin state, and nucleotide context. These are all important determinants of somatic neutral mutation density. We load these data in as GRanges objects using functions in data.table, rtracklayer, and gUtils packages (however you are free to use any GRanges import utility of your choice).
We first load in replication timing data as a GRanges then instantiate it as a Covariate Replication timing information is contained in the $score metadata field of reptime. We instantiate it as a covariate of type “numeric” by specifying field score.
## replication timing for NHEK obtained from https://bit.ly/2sRsXT9 and
## converted to rds via rtracklayer::import
reptimedata = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/RT_NHEK_Keratinocytes_Int92817591_hg19.rds')))
reptime = Cov(data = reptimedata, field = 'score')
Below, context is a GRanges object with 98 columns representing tri, di, and mononucleotide context counts in the hg19 genome. Code for computing context (e.g. for another genome) is provided here.
We instantiate a numeric Covariate object from context, choosing only two of the columns here to take into account G and C content. Note that the covariate object can be vectorized (concatenated, subsetted) and instantiated around several columns of an input GRanges. As a result, contextcov will be length 2 (representing G and C nucleotides fraction).
context = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/nucleotide.context.rds')))
gc = Cov(data = context, field = c('C', 'G'))
Finally we load in chromHMM data for cell line A549 from Epigenomics Roadmap. We will want to create a covariate that will model the fraction of heterochromatic and quiescent regions in each query interval.
To do so, we will create an “interval” covariate by not specifying a metadata field.
### data cached from https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E114_15_coreMarks_mnemonics.bed.gz
hetchromdata = gr.sub(import('http://mskilab.com/fishHook/hg19/E114_15_coreMarks_mnemonics.bed.gz'), 'chr', '') %Q% (name %in% c('8_ZNF/Rpts', '9_Het', '15_Quies'))
hetchrom = Cov(hetchromdata, name = 'Heterochromatin')
We now add these covariates to the model. For type numeric covariates, e.g. replication timing, this will trigger the calculation of the average value of each covariate within the eligible subset of each hypothesis interval. and the fractional overlap of of its eligible subset This annotation is the most computationally intensive and slowest aspect of fishHook analyses, though occurs within a few seconds for this small number of covariates.
fish$covariates = c(reptime, gc, hetchrom)
## FishHook 2018-06-11 19:49:06: Aggregating covariates over eligible subset of hypotheses
## FishHook 2018-06-11 19:49:06: Overlapping with covered intervals
## FishHook 2018-06-11 19:49:06: Finished overlapping with covered intervals
## FishHook 2018-06-11 19:49:06: Annotating track score
## FishHook 2018-06-11 19:49:11: Annotating track C
## FishHook 2018-06-11 19:49:13: Annotating track G
## FishHook 2018-06-11 19:49:16: Annotating track Heterochromatin
## Warning in self$clear(): Resetting scores since covariates re-defined,
## please run $score() to get updated p values
Now that we’ve added covariates, we can rescore fish and compute p values. Looking at these results, we immediately see an improvement in lambda (closer to one) and alpha (increased), and the nominated gene list (no more olfactory receptors, We see also a reasonable number of significant (fdr<0.1) genes at the top of the list (or at the top right of the QQ plot) that have been biologically implicated in lung adenocarcinoma tumorigenesis.
fish$score()
fish$res[order(p)[fdr<0.1, ]
fish$qqp()
| seqnames | start | end | width | strand | gene_name | p | fdr | count | effectsize | count.pred | count.density | count.pred.density | p.neg | fdr.neg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 7565097 | 7590856 | 25760 |
|
TP53 | 0.0e+00 | 0.00000 | 106 | 6.27 | 1.370 | 0.09410 | 0.00121 | 1 | 1 |
| 19 | 10596796 | 10614417 | 17622 |
|
KEAP1 | 0.0e+00 | 0.00000 | 30 | 4.02 | 1.840 | 0.02220 | 0.00137 | 1 | 1 |
| 7 | 55086714 | 55324313 | 237600 |
|
EGFR | 1.0e-07 | 0.00054 | 44 | 3.06 | 5.280 | 0.01200 | 0.00144 | 1 | 1 |
| 7 | 142457319 | 142460923 | 3605 |
|
PRSS1 | 6.9e-06 | 0.03400 | 16 | 3.06 | 1.910 | 0.02160 | 0.00258 | 1 | 1 |
| 1 | 157800704 | 157868046 | 67343 |
|
CD5L | 8.7e-06 | 0.03400 | 14 | 3.07 | 1.670 | 0.01340 | 0.00161 | 1 | 1 |
| 21 | 44513066 | 44527697 | 14632 |
|
U2AF1 | 1.8e-05 | 0.05400 | 8 | 3.38 | 0.768 | 0.01180 | 0.00113 | 1 | 1 |
| 14 | 19553365 | 19590078 | 36714 |
|
POTEG | 2.0e-05 | 0.05400 | 13 | 2.99 | 1.640 | 0.02210 | 0.00279 | 1 | 1 |
| 11 | 62443970 | 62446567 | 2598 |
|
UBXN1 | 2.2e-05 | 0.05400 | 10 | 3.28 | 1.030 | 0.01140 | 0.00118 | 1 | 1 |
| 1 | 175284330 | 175712906 | 428577 |
|
TNR | 2.8e-05 | 0.05700 | 38 | 2.49 | 6.740 | 0.00933 | 0.00166 | 1 | 1 |
| 6 | 136578001 | 136610989 | 32989 |
|
BCLAF1 | 2.9e-05 | 0.05700 | 23 | 2.66 | 3.650 | 0.00834 | 0.00132 | 1 | 1 |
| 7 | 140419127 | 140624564 | 205438 |
|
BRAF | 3.6e-05 | 0.06500 | 21 | 2.66 | 3.320 | 0.00984 | 0.00156 | 1 | 1 |
| 8 | 10463859 | 10569697 | 105839 |
|
RP1L1 | 4.6e-05 | 0.07600 | 48 | 2.34 | 9.450 | 0.00791 | 0.00156 | 1 | 1 |
| 19 | 10416103 | 10426691 | 10589 |
|
FDX1L | 5.3e-05 | 0.08100 | 4 | 4.02 | 0.247 | 0.01650 | 0.00102 | 1 | 1 |
Note that this model is still quite rudimentary (e.g. we have lumped together SNVs and indels, we have not substratified SNVs by mutational context, we have employed very few covariates) but we still get a reasonable gene list and minimal statistical inflation.
We can inspect the parameters of this model see which features it’s using. We can see from the estimate and Pr(|>z|) that replication timing is significantly negatively correlated and heterochromatin is significantly positively correlated with mutational density (as expected).
summary(fish$model)
##
## Call:
## glm.nb(formula = formula, data = as.data.frame(tdt), maxit = iter,
## init.theta = 3.514844957, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1965 -0.9862 -0.2834 0.4381 15.5196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.63330 0.04891 -135.631 < 2e-16 ***
## score -0.51553 0.01023 -50.411 < 2e-16 ***
## C 0.70117 0.41074 1.707 0.087801 .
## G 1.47342 0.41306 3.567 0.000361 ***
## Heterochromatin 0.33214 0.01672 19.860 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.5148) family taken to be 1)
##
## Null deviance: 23853 on 18186 degrees of freedom
## Residual deviance: 19088 on 18182 degrees of freedom
## AIC: 65310
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.5148
## Std. Err.: 0.0935
##
## 2 x log-likelihood: -65298.0760
We can run a similar analysis but choosing only truncating mutations (by subsetting the mutation GRanges). Scoring this new model, we obtain a different mutation list, containing likely candidate drivers with enrichment of frameshift, nonsense, or nonstop indels and SNVs. This includes genes like ARID1A, RBM10, and SETD2.
## replace events with new subset of mutations
fish$events = mutations %Q%
(grepl('(Frame_Shift_)|(Nonsense)|(OutOfFrame)|(Nonstop)', Variant_Classification))
## re-score model and inspect results
fish$score()
head(fish$res[order(p), ])
fish$qqp()
| seqnames | start | end | width | strand | gene_name | p | fdr | count | effectsize | count.pred | count.density | count.pred.density | p.neg | fdr.neg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 7565097 | 7590856 | 25760 |
|
TP53 | 0.0e+00 | 0.00000 | 33 | 7.72 | 0.1570 | 0.02930 | 0.000139 | 1 | 1 |
| 3 | 47057919 | 47205457 | 147539 |
|
SETD2 | 0.0e+00 | 0.00014 | 18 | 4.23 | 0.9620 | 0.00295 | 0.000157 | 1 | 1 |
| X | 47004268 | 47046212 | 41945 |
|
RBM10 | 1.0e-07 | 0.00038 | 6 | 5.52 | 0.1310 | 0.01140 | 0.000249 | 1 | 1 |
| 16 | 3115298 | 3131908 | 16611 |
|
IL32 | 1.2e-06 | 0.00590 | 4 | 5.61 | 0.0820 | 0.00905 | 0.000186 | 1 | 1 |
| 8 | 126442563 | 126450647 | 8085 |
|
TRIB1 | 3.7e-06 | 0.01500 | 4 | 5.48 | 0.0898 | 0.00531 | 0.000119 | 1 | 1 |
| 1 | 27022524 | 27108595 | 86072 |
|
ARID1A | 4.8e-06 | 0.01600 | 10 | 3.84 | 0.6980 | 0.00176 | 0.000123 | 1 | 1 |
Read in and parse reactome pathways as list of gene symbols.
## parse Reactome pathways from .gmt format
pathways = strsplit(readLines('http://mskilab.com/fishHook/hg19/ReactomePathways.gmt'), '\t')
pathways = structure(lapply(pathways, '[', -1), names = sapply(pathways, '[', 1))
## match them to create sets of indices as a named list
sets = sapply(sapply(pathways, match, fish$hypotheses$gene_name), setdiff, NA)
Here is what the pathways and sets look like:
head(pathways[1:2])
## $`2-LTR circle formation`
## [1] "R-HSA-164843" "BANF1" "HMGA1" "LIG4"
## [5] "PSIP1" "XRCC4" "XRCC5" "XRCC6"
## [9] "gag" "gag-pol" "rev" "vif"
## [13] "vpr" "vpu"
##
## $`5-Phosphoribose 1-diphosphate biosynthesis`
## [1] "R-HSA-73843" "PRPS1" "PRPS1L1" "PRPS2"
head(sets[1:2])
## $`2-LTR circle formation`
## [1] 10815 6342 12670 8606 5385 3035 18801
##
## $`5-Phosphoribose 1-diphosphate biosynthesis`
## [1] 19432 7085 18957
The set analysis is a bit more computationally intensive. We can speed things up through parallelization (setting fish$mc.cores = 5). When we set the sets variable to a particular value, this will trigger automatic scoring of hypothesis sets (in this case gene sets). The results are shown in $setres variable.
| setname | p | fdr | effect | estimate | p.left | p.twosided |
|---|---|---|---|---|---|---|
| TP53 Regulates Transcription of Genes Involved in G1 Cell Cycle Arrest | 1.11e-29 | 0 | 16.5 [10.1-27] | 16.5 | 1 | 0 |
| RUNX3 regulates CDKN1A transcription | 1.24e-29 | 0 | 30.6 [16.9-55.7] | 30.6 | 1 | 0 |
| Formation of Senescence-Associated Heterochromatin Foci (SAHF) | 2.78e-29 | 0 | 14.2 [8.89-22.5] | 14.2 | 1 | 0 |
| Activation of PUMA and translocation to mitochondria | 2.93e-29 | 0 | 26.2 [14.8-46.4] | 26.2 | 1 | 0 |
| TP53 Regulates Transcription of Caspase Activators and Caspases | 3.67e-29 | 0 | 16.9 [10.3-27.8] | 16.9 | 1 | 0 |
Examining the results table, we can see that most of the significant pathways appear related to TP53. Indeed, if we inspect the hypotheses (i.e. genes) contributing to these these gene sets, we will see that they are dominated by TP53 and 1-2 additional genes. For example:
fish$res[sets[[fish$setres[order(p), ][fdr<0.1, ][3, setname]]], ][order(p), ]
| seqnames | start | end | width | strand | gene_name | p | fdr | count | effectsize | count.pred | count.density | count.pred.density | p.neg | fdr.neg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 7565097 | 7590856 | 25760 |
|
TP53 | 6.9e-38 | 0.00 | 33 | 7.72 | 0.1570 | 0.029300 | 0.000139 | 1.00 | 1 |
| 13 | 48877887 | 49056122 | 178236 |
|
RB1 | 0.00016 | 0.24 | 6 | 3.58 | 0.5030 | 0.002180 | 0.000183 | 1.00 | 1 |
| 5 | 126112315 | 126172712 | 60398 |
|
LMNB1 | 0.061 | 1.00 | 1 | 1.59 | 0.3320 | 0.000768 | 0.000255 | 0.95 | 1 |
| 12 | 66217911 | 66360075 | 142165 |
|
HMGA2 | 0.081 | 1.00 | 0 | -Inf | 0.0275 | 0.000000 | 0.000137 | 0.97 | 1 |
| 6 | 26017260 | 26018040 | 781 |
|
HIST1H1A | 0.24 | 1.00 | 0 | -Inf | 0.2390 | 0.000000 | 0.000370 | 0.80 | 1 |
| 6 | 27834570 | 27835359 | 790 |
|
HIST1H1B | 0.29 | 1.00 | 0 | -Inf | 0.2020 | 0.000000 | 0.000311 | 0.82 | 1 |
| 6 | 36644305 | 36655116 | 10812 |
|
CDKN1A | 0.46 | 1.00 | 0 | -Inf | 0.0543 | 0.000000 | 0.000140 | 0.95 | 1 |
| 16 | 4896666 | 4932361 | 35696 |
|
UBN1 | 0.48 | 1.00 | 0 | -Inf | 0.4690 | 0.000000 | 0.000138 | 0.65 | 1 |
| 22 | 24407642 | 24574596 | 166955 |
|
CABIN1 | 0.57 | 1.00 | 0 | -Inf | 0.7160 | 0.000000 | 0.000140 | 0.54 | 1 |
| 6 | 34204650 | 34214008 | 9359 |
|
HMGA1 | 0.59 | 1.00 | 0 | -Inf | 0.0186 | 0.000000 | 0.000132 | 0.98 | 1 |
| 6 | 26234440 | 26235216 | 777 |
|
HIST1H1D | 0.67 | 1.00 | 0 | -Inf | 0.1890 | 0.000000 | 0.000285 | 0.83 | 1 |
| 22 | 19318221 | 19435224 | 117004 |
|
HIRA | 0.69 | 1.00 | 0 | -Inf | 0.5410 | 0.000000 | 0.000163 | 0.62 | 1 |
| 22 | 38201114 | 38203442 | 2329 |
|
H1F0 | 0.71 | 1.00 | 0 | -Inf | 0.0719 | 0.000000 | 0.000132 | 0.93 | 1 |
| 6 | 26156559 | 26157343 | 785 |
|
HIST1H1E | 0.76 | 1.00 | 0 | -Inf | 0.1260 | 0.000000 | 0.000355 | 0.88 | 1 |
| 6 | 26055968 | 26056699 | 732 |
|
HIST1H1C | 0.88 | 1.00 | 0 | -Inf | 0.2010 | 0.000000 | 0.000314 | 0.83 | 1 |
| 6 | 119215384 | 119230332 | 14949 |
|
ASF1A | 0.97 | 1.00 | 0 | -Inf | 0.1170 | 0.000000 | 0.000275 | 0.89 | 1 |
This is a common challenge with pathway analysis of mutations, since many cancer pathways are usually driven by a single “celebrity” gene.